-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36036: [C++][Python][Parquet] Implement Float16 logical type #36073
Conversation
47790ee
to
2809fbf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @benibus ! Please note that the Thrift definitions will have to be synchronized from https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift once the Parquet format spec addition is accepted.
Thanks! Before moving forward on some of the suggestions, I just want to ensure that I interpreted the endianness requirement from the proposal correctly. As per the spec:
Is this in line with what I've done here (i.e. the |
Yes! |
@@ -435,6 +435,8 @@ std::shared_ptr<const LogicalType> LogicalType::FromThrift( | |||
return BSONLogicalType::Make(); | |||
} else if (type.__isset.UUID) { | |||
return UUIDLogicalType::Make(); | |||
} else if (type.__isset.FLOAT16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to change thrift_internal.h for serialization of FLOAT16.
5c109d3
to
2f9ca2a
Compare
By the way:
Do you plan to tackle Arrow integration in a subsequent PR? This PR cannot be integrated before the proposal is approved, AFAICT. |
Yes, my plan was to handle Arrow integration in a subsequent PR (in tandem with the Java implementation), and then move forward with the proposal. However, I think you're right that we'll need to roll those features into this PR. Just did a closer re-reading of @julienledem's comments on apache/parquet-format#184, and I believe we're going to want the full C++/Java PRs ready (but not merged) before approving the proposal - so we'll probably need to sit on this one in the meantime. That being said, I can maintain this PR until then. Plus we may need to make adjustments if new requirements come in from the Java side. |
Well, if Arrow integration does not come in this PR, then at least some testing of reading/writing using non-Arrow Parquet APIs should still be added. |
Just curious: Is it enough to get approval if there is a full C++ implementation only? Or java parity should be ready at the meanwhile? |
We probably want a Java implementation as well according to apache/parquet-format#184 (comment) |
OK, if that is a must. I can spare some time to do this. Would be good after this PR gets completed. |
Reverted several prior changes that were accidentally pushed Enabled construction from native floats Removed `uint16_t` conversion operator since it doesn't behave consistently with standard floats. As a result, rolled back some of the prior changes to `random_real` used in the Parquet test utils
a672c73
to
157e0d7
Compare
### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: #37582 Authored-by: benibus <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Now that the proposal is merged, would anyone be willing to take a final look and potentially merge this? I've gone ahead and rebased for good measure. I should also mention that the Go implementation was merged today, but it'll eventually need this PR's |
…37599) ### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: apache#37582 Authored-by: benibus <[email protected]> Signed-off-by: Matt Topol <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for the nits above.
Would it be useful to add tests checking that one write or read a Parquet FLOAT16 with a byte length != 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
( It's a bit weird for me that fp16 can only use PLAIN, RLE_DICT and ... DELTA_BYTE_ARRAY, aha...)
Probably... I'll try to add something, assuming it wouldn't be redundant.
Yeah, I think we're going to specifically address the encodings as a follow-up since there's been some recent discussion in that area. |
Update: I don't think it'd be very useful to add tests for different type lengths at the reader/writer level since it isn't even possible to construct a However, it turns out that I forgot to add the relevant tests to |
LGTM, I'm waiting for a last CI check and will merge afterwards. |
Also we're going to release this in parquet-2.10 |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit b55d13c. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…37599) ### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: apache#37582 Authored-by: benibus <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…apache#36073) ### Rationale for this change There is currently an active proposal to support half-float types in Parquet. For more details/discussion, see the links in this PR's accompanying issue. ### What changes are included in this PR? This PR implements basic support for a `Float16LogicalType` in accordance with the proposed spec. More specifically, this includes: - Changes to `parquet.thrift` and regenerated `parqet_types` files - Basic `LogicalType` class definition, method impls, and enums - Support for specialized comparisons and column statistics In the interest of scope, this PR does not currently deal with arrow integration and byte split encoding - although we will want both of these features resolved before the proposal is approved. ### Are these changes tested? Yes (tests are included) ### Are there any user-facing changes? Yes * Closes: apache#36036 Lead-authored-by: benibus <[email protected]> Co-authored-by: Ben Harkins <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache/arrow#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: #37582 Authored-by: benibus <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Rationale for this change
There is currently an active proposal to support half-float types in Parquet. For more details/discussion, see the links in this PR's accompanying issue.
What changes are included in this PR?
This PR implements basic support for a
Float16LogicalType
in accordance with the proposed spec. More specifically, this includes:parquet.thrift
and regeneratedparqet_types
filesLogicalType
class definition, method impls, and enumsIn the interest of scope, this PR does not currently deal with arrow integration and byte split encoding - although we will want both of these features resolved before the proposal is approved.
Are these changes tested?
Yes (tests are included)
Are there any user-facing changes?
Yes